metrics: add workload-level preemption count gauge by gyliu513 · Pull Request #11252 · kubernetes-sigs/kueue

gyliu513 · 2026-05-15T20:55:34Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

metrics: add workload-level preemption count gauge

netlify · 2026-05-15T20:55:39Z

✅ Deploy Preview for kubernetes-sigs-kueue ready!

Name	Link
🔨 Latest commit	`ac18d60`
🔍 Latest deploy log	https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/6a07beac3197b90009d606d7
😎 Deploy Preview	https://deploy-preview-11252--kubernetes-sigs-kueue.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

k8s-ci-robot · 2026-05-15T20:55:46Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: gyliu513
Once this PR has been reviewed and has the lgtm label, please assign tenzen-y for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mimowo · 2026-05-18T18:10:56Z

cc @amy ptal

amy · 2026-05-18T19:03:26Z


+	// +metricsdoc:group=workload
+	// +metricsdoc:labels=namespace="the namespace of the preempted workload",workload="the name of the preempted workload",cluster_queue="the ClusterQueue of the preempted workload",reason="eviction or preemption reason"
+	WorkloadPreemptionsCount *prometheus.GaugeVec


If you want to model similar to PreemptedWorkloadsTotal, please rename count to total.

amy · 2026-05-18T19:07:49Z

 		}, append([]string{"preempting_cluster_queue", "reason", "replica_role"}, extraLabels...),
 	)

+	WorkloadPreemptionsCount = prometheus.NewGaugeVec(


The Help text says: "Uses a Gauge (rather than a Counter) so that the time series can be deleted when the workload finishes or is removed, keeping cardinality bounded."That rationale is incorrect. CounterVec supports DeletePartialMatch exactly the same way. In fact, ClearClusterQueueMetrics at pkg/metrics/metrics.go:1038 already deletes PreemptedWorkloadsTotal (a CounterVec) via DeletePartialMatch.

Using a Gauge for a value that only ever increments is a Prometheus anti-pattern: it breaks rate()/increase() and the reset-detection semantics those functions rely on. Operators who try to chart preemption rate per workload will get wrong answers.

Please switch to CounterVec and update the help text. The variable should also be renamed accordingly (WorkloadPreemptionsTotal, with metric name workload_preemptions_total), matching both Prometheus naming convention and PreemptedWorkloadsTotal for ClusterQueue.

amy · 2026-05-18T19:08:45Z

+Uses a Gauge (rather than a Counter) so that the time series can be deleted when the workload finishes or is removed, keeping cardinality bounded.
+This gauge is deleted when the workload is completed or deleted.`,
+		}, []string{"namespace", "workload", "cluster_queue", "reason"},
+	)


Please mirror PreemptedWorkloadsTotal. Add replica_role and extraLabels.

amy · 2026-05-18T19:11:08Z

 		EvictedWorkloadsTotal,
 		EvictedWorkloadsOnceTotal,
 		PreemptedWorkloadsTotal,
+		WorkloadPreemptionsCount,


As an example, per-LocalQueue metrics are gated by LocalQueueMetrics: pkg/metrics/metrics.go:1344, RegisterLQMetrics().

Per workload has an even higher cardinality. Please add a featuregate for per workload metrics.

amy · 2026-05-18T19:14:24Z


 	finishedCond := apimeta.FindStatusCondition(wl.Status.Conditions, kueue.WorkloadFinished)
 	if finishedCond != nil && finishedCond.Status == metav1.ConditionTrue {
+		metrics.ClearWorkloadPreemptionMetrics(wl.Namespace, wl.Name)


When workloadRetention.afterFinished is configured, the workload object is retained for some duration after it finishes, but this call wipes its preemption history immediately on the first finished reconcile. An operator inspecting a retained finished workload (which is the whole point of the retention feature) will see zero preemptions even if it was preempted many times.

Move the metrics.ClearWorkloadPreemptionMetrics(...) call into the branch where we actually delete the workload (line 268, just before/after r.client.Delete), and rely on the Delete event handler at line 1141 for the no-retention path. That way the metric lifetime matches the object lifetime.

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. labels May 15, 2026

gyliu513 marked this pull request as draft May 15, 2026 20:55

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 15, 2026

k8s-ci-robot requested review from PBundyra and windsonsea May 15, 2026 20:55

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 15, 2026

metrics: add workload-level preemption count gauge

ac18d60

gyliu513 force-pushed the metric branch from acfd19b to ac18d60 Compare May 16, 2026 00:47

gyliu513 marked this pull request as ready for review May 16, 2026 00:47

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 16, 2026

k8s-ci-robot requested a review from kshalot May 16, 2026 00:47

amy reviewed May 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metrics: add workload-level preemption count gauge#11252

metrics: add workload-level preemption count gauge#11252
gyliu513 wants to merge 1 commit into
kubernetes-sigs:mainfrom
gyliu513:metric

gyliu513 commented May 15, 2026

Uh oh!

netlify Bot commented May 15, 2026 •

edited

Loading

Uh oh!

k8s-ci-robot commented May 15, 2026

Uh oh!

mimowo commented May 18, 2026

Uh oh!

amy May 18, 2026 •

edited

Loading

Uh oh!

amy May 18, 2026

Uh oh!

amy May 18, 2026

Uh oh!

amy May 18, 2026

Uh oh!

amy May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

gyliu513 commented May 15, 2026

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Uh oh!

netlify Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for kubernetes-sigs-kueue ready!

Uh oh!

k8s-ci-robot commented May 15, 2026

Uh oh!

mimowo commented May 18, 2026

Uh oh!

amy May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amy May 18, 2026

Choose a reason for hiding this comment

Uh oh!

amy May 18, 2026

Choose a reason for hiding this comment

Uh oh!

amy May 18, 2026

Choose a reason for hiding this comment

Uh oh!

amy May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

netlify Bot commented May 15, 2026 •

edited

Loading

amy May 18, 2026 •

edited

Loading